Skip to content

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812

Merged
ggreenway merged 78 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor
Apr 6, 2026
Merged

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
ggreenway merged 78 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor

Conversation

@TAOXUY
Copy link
Copy Markdown
Contributor

@TAOXUY TAOXUY commented Mar 6, 2026

Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.

The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.

Fix: we will keep the gauge in the scope cache if it is non-zero

Risk Level: Low

Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.

Docs: NA

Release: NA

Platform Specific Features: no

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY TAOXUY changed the title StatsAccessLogger: StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions. Mar 6, 2026
@ggreenway ggreenway self-assigned this Mar 6, 2026
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

TAOXUY added 2 commits March 8, 2026 18:09
When evicting unused stats from the central cache, we need to ensure that
gauges actively referenced by components like AccessLogState are not evicted.
The use_count() > 1 check prevents this, but a previous bug in evictUnused
where the lambda parameter std::pair<StatName, T> kv was captured by value
caused artificial inflation of the use_count due to the deep copy. This broke
eviction entirely across the codebase.

This commit fixes evictUnused by taking const auto& kv by reference, avoiding
the deep copy and correctly applying the use_count() > 1 safeguard.

Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State
struct so its active references prevent premature eviction by evictUnused. The
erroneous std::min safeguard during gauge subtractions is also removed as
AccessLogState gauges will no longer be unfairly cleared.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 8, 2026

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov

TAOXUY added 4 commits March 8, 2026 21:12
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 9, 2026

/retest

@ggreenway
Copy link
Copy Markdown
Member

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 11, 2026

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

Imagine when a gauge is incremented and then evicted before decremented, there is another
there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

@kyessenov

@ggreenway
Copy link
Copy Markdown
Member

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

The central store is the Store, and all scopes reference the same store. Anytime you get a metric from the scope, if the scope does not already have it, it looks in the store, so it is not possible for two scopes to have different metrics with the same name/tags.

That's why holding a reference to the stat in the FilterState makes this work: it keeps the metric and it's current value from being removed from the Store.

Imagine when a gauge is incremented and then evicted before decremented, there is another there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

In this case, because the FilterState holds a reference, both would be using the same stat for inc/dec, so the value will not be corrupted.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 11, 2026

/retest

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution; it's much cleaner and clearer.

@kyessenov can you also review this, especially the change to eviction logic?

TAOXUY added 3 commits March 11, 2026 22:22
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Apr 1, 2026

/retest

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just a couple small things

/wait

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Signed-off-by: Xuyang Tao <taoxuy@google.com>
TAOXUY added 2 commits April 2, 2026 23:36
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Comment on lines +260 to +262
// We no longer check for the log line here because only 1 test variant will see the log line
// due to log throttling in periodic logs as IPv4 and IPv6 run in the same process.
// stats_test.cc has an EXPECT_LOG_CONTAINS for this situation already.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ggreenway
Copy link
Copy Markdown
Member

You forgot the DCO signoff on the last commit. Please fix; you'll have to force-push

Removed redundant log line check due to log throttling.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY TAOXUY force-pushed the fixStatDestructor branch from fed339f to babe799 Compare April 6, 2026 15:34
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Apr 6, 2026

You forgot the DCO signoff on the last commit. Please fix; you'll have to force-push

Thx for the reminder. Done!

@ggreenway ggreenway enabled auto-merge (squash) April 6, 2026 15:48
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Apr 6, 2026

/retest

@ggreenway ggreenway merged commit a8c2080 into envoyproxy:main Apr 6, 2026
28 checks passed
nshipilov pushed a commit to nshipilov/envoy that referenced this pull request Apr 13, 2026
…ementing metrics after Scope evictions. (envoyproxy#43812)

The original code correctly attempted to prevent "zombie" gauges by
re-resolving metrics against the central store (via
scope_->gaugeFromStatNameWithTags) during request destruction. However,
it tried to reconstruct the gauge's identity using
gauge_->tagExtractedStatName(). This failed because dynamic access-log
tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global
extractors. The extraction process returned a mangled base name and
empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1
from it immediately crashed Envoy with a counter underflow.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Nick Shipilov <nick.shipilov.n@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants